Skip to content

gh-150424: Widen specialized int fast paths to full int64 range#150425

Draft
KRRT7 wants to merge 25 commits into
python:mainfrom
KRRT7:jit-wide-int-fastpath
Draft

gh-150424: Widen specialized int fast paths to full int64 range#150425
KRRT7 wants to merge 25 commits into
python:mainfrom
KRRT7:jit-wide-int-fastpath

Conversation

@KRRT7
Copy link
Copy Markdown

@KRRT7 KRRT7 commented May 25, 2026

Fixes gh-150424.

This PR builds on the initial widening work by splitting the specialised `int` opcodes into two tiers:

  • Compact (`BINARY_OP_{ADD,SUBTRACT,MULTIPLY}_INT`): operands are already in CPython's 30-bit compact range — semantically identical to `main`, no additional overhead.
  • Wide (`BINARY_OP_{ADD,SUBTRACT,MULTIPLY}_INT_WIDE`): new fast path for exact ints in the full `int64_t` range using checked 64-bit arithmetic.

Changes:

  • Split specialisation: `specialize.c` prefers the compact variant when both operands are compact, falling back to the wide variant for larger int64-range values
  • `PyCompactLong{Add,Subtract,Multiply}`: return `PyStackRef_NULL` for overflow and OOM (matching `medium_from_stwodigits` in `main`), eliminating a dead `ERROR_IF` branch from every JIT-compiled compact trace
  • New `_wide_op_result(int64_t)` static inline inlines the small-int cache, single-digit freelist alloc, and `_PyLong_FromLarge` call, cutting the `PyLong_FromInt64` → `_PyLong_FromMedium` call chain for medium-range results
  • Tier-2 optimizer: `_GUARD_TOS/NOS_INT` substitutes `_GUARD_TOS/NOS_COMPACT` when the type is already known to be `PyLong_Type`; new `_GUARD_TOS/NOS_INT_WIDE` substitutes `_GUARD_TOS/NOS_OVERFLOWED` under the same condition
  • Six new Tier-2 inplace ops (`BINARY_OP{ADD,SUBTRACT,MULTIPLY}_INT_WIDE_INPLACE` and `_RIGHT` variants)
  • Add regression coverage for compact/wide split, guard substitution, and overflow fallback

Comparison: main (`bdf14dc`) vs this branch (`5e01840`).
Build: PGO + thin LTO, macOS arm64, Homebrew Clang 22.
Microbenchmarks target int arithmetic across compact and non-compact (int64) ranges.
pyperformance benchmarks are general workloads not specific to this change — included to confirm no regression.

Microbenchmarks (PYTHON_JIT=0)

Note: pyperf worker subprocesses exit before the JIT compilation threshold is reached; JIT=1 numbers are within noise of JIT=0 and not shown separately.

Benchmark main branch Change
int_small 269 ns not significant
int_compact 134 ns not significant
int_intermediate_overflow 1.11 µs 706 ns 1.57x faster
int_double_add 1.21 µs 903 ns 1.34x faster
int_accumulate 667 ns 517 ns 1.29x faster
int_always_large 435 ns 328 ns 1.33x faster
int_mixed 155 ns 118 ns 1.32x faster
Geometric mean 1.25x faster

pyperformance — interpreter only (PYTHON_JIT=0)

12 numeric/float-heavy benchmarks. Build: PGO + thin LTO, macOS arm64, Homebrew Clang 22.

Benchmark main branch Change Significant
chaos 27.2 ms 27.5 ms 1.01x slower No
float 35.6 ms 35.5 ms 1.00x faster No
nbody 51.1 ms 50.4 ms 1.01x faster No
pidigits 228 ms 228 ms 1.00x No
pyflate 212 ms 218 ms 1.03x slower Yes
raytrace 129 ms 130 ms 1.00x slower No
scimark_fft 134 ms 136 ms 1.01x slower No
scimark_lu 54.2 ms 54.0 ms 1.00x faster No
scimark_monte_carlo 31.7 ms 31.8 ms 1.00x slower No
scimark_sor 57.7 ms 58.1 ms 1.01x slower No
scimark_sparse_mat_mult 2.17 ms 2.13 ms 1.02x faster No
spectral_norm 47.4 ms 46.8 ms 1.01x faster No

pyperformance — JIT (PYTHON_JIT=1)

Benchmark main branch Change Significant
chaos 27.2 ms 27.6 ms 1.01x slower No
float 35.6 ms 35.5 ms 1.00x faster No
nbody 51.2 ms 50.5 ms 1.01x faster No
pidigits 228 ms 228 ms 1.00x No
pyflate 212 ms 218 ms 1.03x slower Yes
raytrace 130 ms 130 ms 1.00x slower No
scimark_fft 135 ms 136 ms 1.01x slower No
scimark_lu 54.1 ms 54.0 ms 1.00x faster No
scimark_monte_carlo 31.7 ms 31.8 ms 1.00x slower No
scimark_sor 57.5 ms 58.1 ms 1.01x slower No
scimark_sparse_mat_mult 2.17 ms 2.25 ms 1.04x slower No
spectral_norm 47.6 ms 46.7 ms 1.02x faster Yes

pyflate regression (1.03x slower, significant in both modes, t≈−16): likely caused by the `DEOPT_IF(!_PyLong_IsNonNegativeCompact(...))` guards added to `_BINARY_OP_SUBSCR_LIST_INT` / `_STORE_SUBSCR_LIST_INT` in the correctness fix — pyflate does heavy bit-pattern indexing through those bytecodes on every iteration. The guard is necessary for correctness but adds a branch on the hot path that wasn't there before.

KRRT7 added 4 commits May 25, 2026 12:52
Replace the 30-bit compact-only range check (is_medium_int) with
__builtin_add_overflow / sub_overflow / mul_overflow, widening
the JIT arithmetic fast path from ±2^30 to ±2^62.

Relax operand guards from _PyLong_CheckExactAndCompact to
PyLong_CheckExact so non-compact inputs also stay in the JIT trace.

Add inline _PyLong_AsInt64 for fast digit extraction from non-compact
PyLongs (avoids calling the heavy PyLong_AsLongLongAndOverflow).

Results (pyperf, ARM64, rigorous):
  intermediate_overflow  2.70x faster
  double_add             2.73x faster
  accumulate             1.65x faster
  geometric mean         1.52x faster

Includes pyperf-based microbenchmark (Tools/scripts/jit_int_benchmark_pyperf.py)
and a simpler timeit-based version (jit_int_benchmark.py).
@bedevere-app
Copy link
Copy Markdown

bedevere-app Bot commented May 25, 2026

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

@python-cla-bot
Copy link
Copy Markdown

python-cla-bot Bot commented May 25, 2026

All commit authors signed the Contributor License Agreement.

CLA signed

@KRRT7 KRRT7 changed the title gh-150424: Fix widened JIT int fast paths gh-150424: Widen JIT int fast paths to full int64 range May 25, 2026
@picnixz
Copy link
Copy Markdown
Member

picnixz commented May 25, 2026

15-bit builds

Do we actually support such builds?

@bedevere-app
Copy link
Copy Markdown

bedevere-app Bot commented May 25, 2026

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

@picnixz
Copy link
Copy Markdown
Member

picnixz commented May 25, 2026

I suspect this has been entirely generated by an LLM (at least the summary has been generated by that). It's unclear what the issue is so please first wait for some feedbcak on the issue before opening the PR. This is the common process as highlighted in our devguide.

@picnixz picnixz closed this May 25, 2026
@KRRT7
Copy link
Copy Markdown
Author

KRRT7 commented May 25, 2026

I suspect this has been entirely generated by an LLM (at least the summary has been generated by that). It's unclear what the issue is so please first wait for some feedbcak on the issue before opening the PR. This is the common process as highlighted in our devguide.

Hi, that's incorrect, I'm talking to a core dev directly on the the PR and as noted by the status, it's still a draft and WIP

@KRRT7
Copy link
Copy Markdown
Author

KRRT7 commented May 25, 2026

Do we actually support such builds?

15-bit builds are still supported: --enable-big-digits=15 is a documented configure option, and CPython still has both 15-bit and 30-bit PyLong layouts. I called it out because widening the JIT path to full int64_t exposed a real correctness issue on that

https://github.com/python/cpython/blob/main/Doc/using/configure.rst#L229-L237
https://github.com/python/cpython/blob/main/configure.ac#L6452-L6466

@picnixz
Copy link
Copy Markdown
Member

picnixz commented May 25, 2026

Hi, that's incorrect, I'm talking to a core dev directly on the the PR and as noted by the status

Who's the core dev?


Ok for the 15-bit but please, first share a reproducer on the issue before jumping onto PRs. I don't know if you want to address correctness or performance. Those are different. And the described scope is not helpful as there are too many points on the issue.

@KRRT7
Copy link
Copy Markdown
Author

KRRT7 commented May 25, 2026

I don't know if you want to address correctness or performance.

I'm addressing performance, the performance changes surfaced correctness issues on 15-bit by the test suite.

the described scope is not helpful as there are too many points on the issue.

I'm working on updating the title and bodies, as I said, this is a draft / WIP, I still haven't settled the final body as I'm doing some cleanup

@picnixz
Copy link
Copy Markdown
Member

picnixz commented May 25, 2026

Before doing any work, we should first understand the issue. It doesn't make sense to make PRs and update the issue accordingly. This is not how our worklow is. So please, first discuss the change on the issue by including a reproducer.

@KRRT7
Copy link
Copy Markdown
Author

KRRT7 commented May 28, 2026

Before doing any work, we should first understand the issue.

I don't think what I'm doing is an "issue", though arguably performance bottlenecks are issues (which is something that I personally believe).

It doesn't make sense to make PRs and update the issue accordingly. This is not how our worklow is.

it is however, Github native workflow (opening PRs as a draft, a rough PR where as a draft it is iterated and improved upon), where cpython happens to be hosted in.

@KRRT7 KRRT7 changed the title gh-150424: Widen JIT int fast paths to full int64 range gh-150424: Widen specialized int fast paths to full int64 range May 28, 2026
@picnixz
Copy link
Copy Markdown
Member

picnixz commented May 28, 2026

it is however, Github native workflow (opening PRs as a draft, a rough PR where as a draft it is iterated and improved upon), where cpython happens to be hosted in.

We have a policy written on the devguide. Contributors need to follow that policy. The purpose is to avoid back-and-forth so knowing first what the issue is about is the first step. PRs come after.

Nothing forbids you to open something on your fork though. But any change you make in a PR against CPython takes CI resources which affects other contributors.

@picnixz picnixz reopened this May 28, 2026
@bedevere-app
Copy link
Copy Markdown

bedevere-app Bot commented May 28, 2026

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

1 similar comment
@bedevere-app
Copy link
Copy Markdown

bedevere-app Bot commented May 28, 2026

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

@picnixz
Copy link
Copy Markdown
Member

picnixz commented May 28, 2026

Considering @Fidget-Spinner's stance, I'm reopening it. But please follow what the devguide says. We have too many AI-generated PRs that take both triagers and reviewers time.

When it comes to performance PRs, we really want to see benchmarks before PRs even if in a draft state. This gives more intution on whether the change is worth or not.

Copy link
Copy Markdown
Contributor

@eendebakpt eendebakpt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting approach! I left some comments for improvement. Could you update the benchmarks to include both the normal and the jit build?

Comment thread Python/bytecodes.c Outdated
Comment thread Objects/longobject.c
Comment thread Include/internal/pycore_long.h Outdated
Comment thread Include/internal/pycore_long.h
Comment thread Include/internal/pycore_long.h Outdated
Comment thread Tools/scripts/jit_int_benchmark_pyperf.py
@bedevere-app
Copy link
Copy Markdown

bedevere-app Bot commented May 28, 2026

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

@KRRT7
Copy link
Copy Markdown
Author

KRRT7 commented May 29, 2026

@picnixz locally I saw a perf regression after applying my changes when responding to the review so I made some other changes and that restore the speedup for most cases except for int_small, I'm running pyperf now.

@KRRT7
Copy link
Copy Markdown
Author

KRRT7 commented May 29, 2026

I've updated the benchmarks, with JIT on and off.

@KRRT7 KRRT7 marked this pull request as ready for review May 29, 2026 04:42
@KRRT7
Copy link
Copy Markdown
Author

KRRT7 commented May 29, 2026

Follow-up from review:

  • Tightened _PyLong_MightFitInt64() so it rejects max-digit values outside the signed 64-bit range instead of using digit count alone.
  • Fixed tier-2 optimizer facts so _GUARD_*_INT_WIDE records PyLong_Type, not compact-int state.
  • Added regression coverage for 1 << 64 staying on generic BINARY_OP instead of specializing to *_INT_WIDE.

@eendebakpt
Copy link
Copy Markdown
Contributor

I've updated the benchmarks, with JIT on and off.

Thanks! Results look good. I am wondering where the performance gain is coming from. I can think of at least three options:

  • Covering more ints in the specializing code, e.g. avoiding the slow path via __add__
  • Covering more ints, causing less deopts
  • The fast path for int64 (e.g. ndigits <= _PY_LONG_MAX_DIGITS_FOR_INT64 , the _PyLong_TryAsInt64Exact and _Py_i64_add_overflow combination)

(probably it is a combination of the above). If the fast path for int64 is not contributing to the performance, we could drop it to simplify the code, if it is a significant contribution we could first create a PR adding the _PyLong_TryAsInt64Exact to long_add (in that way we can split this PR in multiple smaller PRs, each easier to review)

I see in one of the recent commits you splitted _BINARY_OP_SUBTRACT_INT into _BINARY_OP_SUBTRACT_INT and _BINARY_OP_SUBTRACT_INT_WIDE. What was the reason for that? It would be nice to keep the number of opcodes small. Even stronger: could we just have a single _BINARY_OP_SUBTRACT_INT that covers all ints? (I suspect it would be good for the int paths, but maybe interacts poorly with float+int addition)

Could you explore the above? (in separate branches to keep this PR a bit stable)

And to make life of reviewers easier: either mark the PR as draft, or keep changes to a minimum. Reviewing a moving target is not easy

@KRRT7
Copy link
Copy Markdown
Author

KRRT7 commented May 29, 2026

@picnixz I marked it as ready to review because I considered it ready to review, however once I did that, I noticed the CI started to fail which is why I made more changes, let me switch it back to draft, and after that, I'll only mark it as ready for review once we both agree it is in fact, ready for that?

@KRRT7 KRRT7 marked this pull request as draft May 29, 2026 08:27
Comment thread Objects/longobject.c
{
if (IS_SMALL_INT(x)) {
return PyStackRef_FromPyObjectBorrow(get_small_int((sdigit)x));
if ((b > 0 && a > INT64_MAX - b) || (b < 0 && a < INT64_MIN - b)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be faster alternatives to this. See https://stackoverflow.com/a/44830670

(not sure this part is performance critical, and how often we branch to false/true though)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually did attempt __builtin_add_overflow but I was having some issues with it, I believe it had to do with cache misses, I'll have to double check

@picnixz
Copy link
Copy Markdown
Member

picnixz commented May 29, 2026

I think you can put your PR up to review when it is ready and let Ken Jin shepherd it as he was supportive of the idea (if he wants to)

@KRRT7
Copy link
Copy Markdown
Author

KRRT7 commented May 29, 2026

oops sorry, I tagged the wrong person here @picnixz , I meant to tag @eendebakpt , got my thoughts messed up, apologies!

@KRRT7
Copy link
Copy Markdown
Author

KRRT7 commented May 29, 2026

@eendebakpt

I see in one of the recent commits you splitted _BINARY_OP_SUBTRACT_INT into _BINARY_OP_SUBTRACT_INT and _BINARY_OP_SUBTRACT_INT_WIDE. What was the reason for that? It would be nice to keep the number of opcodes small. Even stronger: could we just have a single _BINARY_OP_SUBTRACT_INT that covers all ints? (I suspect it would be good for the int paths, but maybe interacts poorly with float+int addition)

the compact and wide variants are separate opcodes so that the compact arithmetic path has no ERROR_IF branch in JIT traces. _PyCompactLong_Subtract (compact-only) returns PyStackRef_NULL for both overflow and OOM, so the compact uop body is just:

https://github.com/KRRT7/cpython/blob/4d495669265/Python/executor_cases.c.h#L4772-L4777

res = _PyCompactLong_Subtract(...);
EXIT_IF(PyStackRef_IsNull(res));

The wide variant (_PyCompactLong_SubtractWide) additionally has an ERROR_IF for the overflow case that compact arithmetic can never trigger:

https://github.com/KRRT7/cpython/blob/4d495669265/Python/executor_cases.c.h#L4938-L4955

res = _PyCompactLong_SubtractWide(...);
EXIT_IF(PyStackRef_IsNull(res));
// ...
ERROR_IF(PyStackRef_IsError(res));

If they were one op, every JIT trace for compact int subtraction would carry that dead ERROR_IF edge → larger traces → more icache pressure.

this was causing a perf regression in my micro benchmarks which is how I caught it

KRRT7 and others added 4 commits May 29, 2026 08:32
…rd substitution

- BINARY_OP_ADD_INT (tier1) handles all int64-range ints
- Removed all _WIDE tier1 opcodes (no dead code)
- Optimizer substitutes _GUARD_TOS_OVERFLOWED when type known as PyLong
- Removed _WIDE_INPLACE tier2 ops, merged inplace ops handle both compact and wide
- All tests pass
- _GUARD_NOS/TOS_INT use PyLong_CheckExact (accepts all exact ints)
- _BINARY_OP_{ADD,SUBTRACT,MULTIPLY}_INT removed BothAreCompact assert
- Pure ops have ERROR_IF for int64 overflow
- _PyCompactLong_{Add,Subtract,Multiply} branch internally on compactness
- Removed all remaining _WIDE dead code
- Regenerated case files
_GUARD_TOS_INT only checks PyLong_CheckExact, so wide (non-compact)
ints can reach _BINARY_OP_SUBSCR_LIST_INT and _STORE_SUBSCR_LIST_INT,
both of which call _PyLong_CompactValue unconditionally.  The
specializer correctly gates on _PyLong_IsNonNegativeCompact, but the
runtime guard did not enforce that invariant.

Add EXIT_IF / DEOPT_IF(!_PyLong_IsNonNegativeCompact) before each
_PyLong_CompactValue call so wide-int indices fall back to the slow
path instead of hitting the assertion.

The bug was latent before this branch because BINARY_OP_ADD_INT would
deopt on wide-int operands, so wide results rarely appeared as list
subscripts that had already been specialised for compact ints.
@KRRT7
Copy link
Copy Markdown
Author

KRRT7 commented May 29, 2026

I see in one of the recent commits you splitted _BINARY_OP_SUBTRACT_INT into _BINARY_OP_SUBTRACT_INT and _BINARY_OP_SUBTRACT_INT_WIDE. What was the reason for that? It would be nice to keep the number of opcodes small. Even stronger: could we just have a single _BINARY_OP_SUBTRACT_INT that covers all ints?

I explored the merged approach you suggested and replaced the split compact/wide design with a single _BINARY_OP_{ADD,SUBTRACT,MULTIPLY}_INT that handles all int64-range ints. The _PyCompactLong_{Add,Subtract,Multiply} functions branch internally on compactness — fast compact path when both are single-digit, transparent int64 fallback otherwise. All _WIDE opcodes and their corresponding tier-1 macros have been removed.

also I am worried about the following scenario: we have a part of the code working with ints both compact (99%) and wide(1%)... it specializes on the compact ones... then a single wide int arrives and we deopt.

With the merged design this doesn't happen — the single guard (PyLong_CheckExact) accepts all exact ints, and the merged arithmetic handles both compact and wide transparently. I also benchmarked the scenario: the original split handled it fine too (side-exits are fast, no thrashing).

I am wondering where the performance gain is coming from.

I created experiment branches to decompose this. The gain is almost entirely from the custom inline int64 fast path (_PyLong_TryAsInt64Exact + _wide_op_result). Replacing those with standard PyLong_AsLongLongAndOverflow + PyLong_FromLongLong was 1.3x slower overall. The compact-only fast path contributes ~6%. The opcode split itself contributes <1%.

Benchmarks (PGO+LTO, clang/LLVM, Apple M-series) against origin/main:

Targeted wide-int microbenchmarks (operands in [2^30, 2^63), the range the fast path covers):

Benchmark main branch Speedup
wide_add (loop over 2**32 + k) 31.1 ms 22.5 ms 1.39x
wide_mul (loop over 2**32 * k) 37.7 ms 27.7 ms 1.36x
timestamp_arith (Unix timestamp ± day delta) 29.9 ms 22.1 ms 1.36x
geometric mean 1.37x faster
Benchmark script
import pyperf

BASE = 2**32        # 4_294_967_296 — solidly in wide-int range
DELTA = 2**32 + 1

def bench_wide_add(loops):
    x = BASE
    y = DELTA
    for _ in range(loops):
        x = x + y - y
    return x

def bench_wide_mul(loops):
    x = BASE
    y = 3
    for _ in range(loops):
        x = x * y // y
    return x

TS_BASE = 1_748_000_000   # representative Unix timestamp (wide-int range)
TS_DELTA = 86400          # one day in seconds (compact)

def bench_timestamp_arith(loops):
    t = TS_BASE
    d = TS_DELTA
    for _ in range(loops):
        t = t + d - d
    return t

runner = pyperf.Runner()
runner.bench_func('wide_add',          bench_wide_add,        1_000_000)
runner.bench_func('wide_mul',          bench_wide_mul,        1_000_000)
runner.bench_func('timestamp_arith',   bench_timestamp_arith, 1_000_000)

General pyperformance benchmarks (crypto_pyaes, deltablue, go, hexiom, nqueens): no significant difference on any — all within 1–2% noise, as expected for workloads that don't produce wide-range integers.


One correctness fix also landed in this branch: _BINARY_OP_SUBSCR_LIST_INT and _STORE_SUBSCR_LIST_INT were calling _PyLong_CompactValue unconditionally, but their runtime guard (_GUARD_TOS_INT) only checks PyLong_CheckExact — so a wide int could pass the guard and hit the assertion. The specializer correctly gates on _PyLong_IsNonNegativeCompact at specialization time, but that invariant wasn't enforced at execution time. Fixed by adding EXIT_IF/DEOPT_IF(!_PyLong_IsNonNegativeCompact(...)) before each _PyLong_CompactValue call. The bug was latent before this branch because BINARY_OP_ADD_INT previously deopted on wide operands, so wide results rarely reached an already-specialised subscript bytecode.

Comment thread Include/internal/pycore_long.h Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Widen specialized int fast paths to full int64 range

3 participants